changes to address feedback#52
Conversation
|
|
Can we add something about the ease of rechunking based on a virtual Zarr store? |
Added a point about virtual stores simplifying rechunking or regridding.
|
@doug-newman-nasa yes thank you for the reminder, I've added a line in in this commit: 078b138. |
| <img style="height: 150px; margin: 0px auto; display: block" alt="Simple Virtual Zarr Graphic" src="./graphics/simple-virtual-zarr.svg" /> | ||
|
|
||
| Virtual stores deliver a single entrypoint to a dataset comprised of many files. For NASA datasets this enables: | ||
| The performance of legacy scientific data formats is poor in a cloud environment. Cloud-optimized formats like Zarr, COG, and cloud-optimized HDF5 address this — but reprocessing or copying the entire NASA archive into these formats is not feasible. Virtual stores bridge this gap: they provide cloud-optimized access to existing archived data without copying it. |
There was a problem hiding this comment.
reprocessing or copying the entire NASA archive into these formats is not feasible
Why not? In an ideal world we would not need virtual stores, because data providers would write their data into the cloud in a suitable format in the first place. IMO we should be careful not to encourage a narrative that lift-and-shift is totally fine and okay because virtualization exists.
There was a problem hiding this comment.
It's my understanding that these archival formats exist for a reason. The inclusion of metadata in each file is a self-describing feature which makes it possible to move and use files on their own. I don't think NASA is ready to shift users away or able to relinquish this archival format requirement. But I'm curious what @doug-newman-nasa would say to that.
There was a problem hiding this comment.
makes it possible to move and use files on their own.
If file download is treated as an access pattern rather than the source of truth then that need can still be served from cloud-native stores.
I don't think NASA is ready to shift users away or able to relinquish this archival format requirement.
Yeah I'm not suggesting that NASA is at all ready for this paradigm shift today, but I am suggesting that that is the real ultimate goal, and maybe we should make that explicit.
There was a problem hiding this comment.
ah ok I see, I think I can rephrase it with that goal in mind...
There was a problem hiding this comment.
@TomNicholas let me know what you think of the rewording in 9adbd0a
There was a problem hiding this comment.
Yes I much prefer that framing, thank you. Only nit is that in the final sentence you said that virtual stores avoid the need for "reprocessing", but I think you really mean they avoid they need for "duplicating".
There was a problem hiding this comment.
With "reprocessing", I was thinking of the case where data needs to be rechunked into cloud-friendly chunk structures. But I'm fine with "duplicating the underlying data" as I think it covers the bases of either reprocessing to cloud-optimized chunks in the same or a new format.
| ## Language and ecosystem constraints | ||
|
|
||
| Icechunk is written in Rust with an API in Python. Users and data providers working in other languages (Julia, R, Java, etc.) may face limited or no support for reading and writing Icechunk stores. | ||
| Icechunk is written in Rust with an API in Python. Users and data providers working in other languages (Julia, R, Java, etc.) may face limited or no support for reading and writing Icechunk stores. Rust presents an organizational risk similar to what NASA has experienced with niche languages in other systems: supporting and extending Icechunk long-term would require NASA staff or contractors with Rust expertise, which is not yet widely available in the earth science community. Rust is seeing broader general adoption than some past niche languages, which reduces but does not eliminate this risk. |
There was a problem hiding this comment.
I think this also captures Doug's feedback nicely. (Without naming a particular software system and its language choices 😉)
Is this the most important limitation to list, though? Feels like the points about chunk shape/size and other data product related considerations might want to get the top billing, and this could be moved down maybe? Or would that just bury the concern?
There was a problem hiding this comment.
good point, I actually think it's the least important? I hadn't thought about the order signaling level of significance of the limitation but I think the new order (moving the language limitation to the bottom) reflects the ordering of significance that I would propose.
Revised explanation of legacy scientific data formats and their performance in cloud environments. Clarified the role of virtual stores in providing cloud-optimized access without data duplication.
owenlittlejohns
left a comment
There was a problem hiding this comment.
Thanks for the updates @abarciauskas-bgse!
No description provided.